NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Automatic Tracing in Task-Based Runtime Systems

https://doi.org/10.1145/3669940.3707237

Yadav, Rohan; Bauer, Michael; Broman, David; Garland, Michael; Aiken, Alex; Kjolstad, Fredrik (March 2025, ACM)

Free, publicly-accessible full text available March 30, 2026
Composing Distributed Computations Through Task and Kernel Fusion

https://doi.org/10.1145/3669940.3707216

Yadav, Rohan; Sundram, Shiv; Lee, Wonchan; Garland, Michael; Bauer, Michael; Aiken, Alex; Kjolstad, Fredrik (March 2025, ACM)

Free, publicly-accessible full text available March 30, 2026
Quanto: optimizing quantum circuits with automatic generation of circuit identities

https://doi.org/10.1088/2058-9565/ad5b16

Pointing, Jessica; Padon, Oded; Jia, Zhihao; Ma, Henry; Hirth, Auguste; Palsberg, Jens; Aiken, Alex (July 2024, Quantum Science and Technology)

Abstract Existing quantum compilers focus on mapping a logical quantum circuit to a quantum device and its native quantum gates. Only simple circuit identities are used to optimize the quantum circuit during the compilation process. This approach misses more complex circuit identities, which could be used to optimize the quantum circuit further. We propose Quanto, the first quantum optimizer that automatically generates circuit identities. Quanto takes as input a gate set and generates provably correct circuit identities for the gate set. Quanto’s automatic generation of circuit identities includes single-qubit and two-qubit gates, which leads to a new database of circuit identities, some of which are novel to the best of our knowledge. In addition to the generation of new circuit identities, Quanto’s optimizer applies such circuit identities to quantum circuits and finds optimized quantum circuits that have not been discovered by other quantum compilers, including IBM Qiskit and Cambridge Quantum Computing Tket. Quanto’s database of circuit identities could be applied to improve existing quantum compilers and Quanto can be used to generate identity databases for new gate sets.
more » « less
Full Text Available
Legate Sparse: Distributed Sparse Computing in Python

https://doi.org/10.1145/3581784.3607033

Yadav, Rohan; Lee, Wonchan; Elibol, Melih; Papadakis, Manolis; Lee-Patti, Taylor; Garland, Michael; Aiken, Alex; Kjolstad, Fredrik; Bauer, Michael (November 2023, ACM)

Full Text Available
SpDISTAL: Compiling Distributed Sparse Tensor Computations

https://doi.org/10.1109/SC41404.2022.00064

Yadav, Rohan; Aiken, Alex; Kjolstad, Fredrik (November 2022, IEEE/ACM)

We introduce SpDISTAL, a compiler for sparse tensor algebra that targets distributed systems. SpDISTAL combines separate descriptions of tensor algebra expressions, sparse data structures, data distribution, and computation distribution. Thus, it enables distributed execution of sparse tensor algebra expressions with a wide variety of sparse data structures and data distributions. SpDISTAL is implemented as a C++ library that targets a distributed task-based runtime system and can generate code for nodes with both multi-core CPUs and multiple GPUs. SpDISTAL generates distributed code that achieves performance competitive with hand-written distributed functions for specific sparse tensor algebra expressions and that outperforms general interpretation-based systems by one to two orders of magnitude.
more » « less
Full Text Available
Quartz: superoptimization of Quantum circuit

https://doi.org/10.1145/3519939.3523433

Xu, Mingkuan; Li, Zikun; Padon, Oded; Lin, Sina; Pointing, Jessica; Hirth, Auguste; Ma, Henry; Palsberg, Jens; Aiken, Alex; Acar, Umut A.; et al (June 2022, PLDI 2022: Proceedings of the 43rd ACM SIGPLAN International Conference on Programming Language Design and Implementation)

Full Text Available
Development of a discontinuous Galerkin solver using Legion for heterogeneous high-performance computing architectures

https://doi.org/10.2514/6.2021-0140

Bando, Kihiro; Brill, Steven; Slaughter, Elliott; Sekachev, Michael; Aiken, Alex; Ihme, Matthias (January 2021, AIAA Scitech 2021 Forum)

This work discusses the development, verification and performance assessment of a discontinuous Galerkin solver for the compressible Navier-Stokes equations using the Legion programming system. This is motivated by (i) the potential of this family of high-order numer- ical methods to accurately and efficiently realize scale-resolving simulations on unstructured grids and (ii) the desire to accommodate the utilization of emerging compute platforms that exhibit increased parallelism and heterogeneity. As a task-based programming model specifically designed for performance portability across distributed heterogeneous architectures, Legion represents an interesting lternative to the traditional approach of using Message Passing Interface for massively parallel computational physics solvers. Following detailed discussion of the implementation, the high-order convergence of the solver is demonstrated by a suite of canonical test cases and good strong scaling behavior is obtained. This work constitutes a first step towards a research platform that is able to be deployed and efficiently run on modern supercomputers.
more » « less
Full Text Available
Beyond Data and Model Parallelism for Deep Neural Networks

Jia, Zhihao; Zaharia, Matei; Aiken, Alex (April 2019, SysML 2019)

Existing deep learning systems commonly parallelize deep neural network (DNN) training using data or model parallelism, but these strategies often result in suboptimal parallelization performance. We introduce SOAP, a more comprehensive search space of parallelization strategies for DNNs that includes strategies to parallelize a DNN in the Sample, Operator, Attribute, and Parameter dimensions. We present FlexFlow, a deep learning engine that uses guided randomized search of the SOAP space to find a fast parallelization strategy for a specific parallel machine. To accelerate this search, FlexFlow introduces a novel execution simulator that can accurately predict a parallelization strategy’s performance and is three orders of magnitude faster than prior approaches that execute each strategy. We evaluate FlexFlow with six real-world DNN benchmarks on two GPU clusters and show that FlexFlow increases training throughput by up to 3.3× over state-of-the-art approaches, even when including its search time, and also improves scalability.
more » « less
Full Text Available
TASO: optimizing deep learning computation with automatic generation of graph substitutions

https://doi.org/10.1145/3341301.3359630

Jia, Zhihao; Padon, Oded; Thomas, James; Warszawski, Todd; Zaharia, Matei; Aiken, Alex (October 2019, SOSP)

Existing deep neural network (DNN) frameworks optimize the computation graph of a DNN by applying graph transformations manually designed by human experts. This approach misses possible graph optimizations and is difficult to scale, as new DNN operators are introduced on a regular basis. We propose TASO, the first DNN computation graph optimizer that automatically generates graph substitutions. TASO takes as input a list of operator specifications and generates candidate substitutions using the given operators as basic building blocks. All generated substitutions are formally verified against the operator specifications using an automated theorem prover. To optimize a given DNN computation graph, TASO performs a cost-based backtracking search, applying the substitutions to find an optimized graph, which can be directly used by existing DNN frameworks. Our evaluation on five real-world DNN architectures shows that TASO outperforms existing DNN frameworks by up to 2.8X, while requiring significantly less human effort. For example, TensorFlow currently contains approximately 53,000 lines of manual optimization rules, while the operator specifications needed by TASO are only 1,400 lines of code.
more » « less
Full Text Available
Task Bench: A Parameterized Benchmark for Evaluating Parallel Runtime Performance

https://doi.org/10.1109/SC41405.2020.00066

Slaughter, Elliott; Wu, Wei; Fu, Yuankun; Brandenburg, Legend; Garcia, Nicolai; Kautz, Wilhem; Marx, Emily; Morris, Kaleb S.; Cao, Qinglei; Bosilca, George; et al (November 2020, SC20: International Conference for High Performance Computing, Networking, Storage and Analysis)

Full Text Available

« Prev Next »

Search for: All records